Created by the Dream Team: Daniel Ryan, Josh Hascall, and Zach Suzuki
The task for classifying is to determine which flight class (eco, eco plus, business) a customer that has filled out the survey for belongs in. The business case for this is to help the airline understand which customers it is best for us to follow up with given their satisfaction with the flight. We are operating under the assumption that the passenger fills out the survey provided but isn't asked to include their class in the survey or future regulations prevent us from asking this question. If the latter scenario occurs, then we would have the data to train and test a logistic regression model.
We think that estimating the class of the respondent is worth training a model on because we can target different types and amounts of advertising content based on this amount. It is our belief that airlines are most interested in retaining and improving relationships with business class clients over economy clients. This is simply because business class clients pay more for their flights; using business jargon, Business class fliers have the highest LTV (lifetime value). Business class tickets are worth orders of magnitude more than economy tickets much of the time, which means the return on investment of advertising to a business class client and getting them to fly with the same airline again is much higher than the retention of an economy customer, even if we must spend more on advertising to acquire and retain these customers. The ultimate goal is to maximize the LTV / CAC (customer acquisition cost) ratio in order to get the most profit. By understanding which customers we are marketing to, we gain better understanding of the potential LTV, and a better idea of how much we can spend to acquire that customer.
We want to note that this is an extremely difficult classification task given the low correlation of these survey responses with class. What we're trying to say is that it is very valuble if we can train a logistic regression algorithm to effectively classify this because it's difficult to do so and we can yield value in higher LTV/CAC in doing so.
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from scipy.special import expit
from sklearn.impute import KNNImputer
import copy
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
df = pd.read_csv("APS.csv", index_col=0)
df.head()
| id | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | ... | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | ... | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18.0 | neutral or dissatisfied |
| 1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | ... | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6.0 | neutral or dissatisfied |
| 2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | ... | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0.0 | satisfied |
| 3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | ... | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9.0 | neutral or dissatisfied |
| 4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | ... | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0.0 | satisfied |
5 rows × 24 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 103904 entries, 0 to 103903 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 103904 non-null int64 1 Gender 103904 non-null object 2 Customer Type 103904 non-null object 3 Age 103904 non-null int64 4 Type of Travel 103904 non-null object 5 Class 103904 non-null object 6 Flight Distance 103904 non-null int64 7 Inflight wifi service 103904 non-null int64 8 Departure/Arrival time convenient 103904 non-null int64 9 Ease of Online booking 103904 non-null int64 10 Gate location 103904 non-null int64 11 Food and drink 103904 non-null int64 12 Online boarding 103904 non-null int64 13 Seat comfort 103904 non-null int64 14 Inflight entertainment 103904 non-null int64 15 On-board service 103904 non-null int64 16 Leg room service 103904 non-null int64 17 Baggage handling 103904 non-null int64 18 Checkin service 103904 non-null int64 19 Inflight service 103904 non-null int64 20 Cleanliness 103904 non-null int64 21 Departure Delay in Minutes 103904 non-null int64 22 Arrival Delay in Minutes 103594 non-null float64 23 satisfaction 103904 non-null object dtypes: float64(1), int64(18), object(5) memory usage: 19.8+ MB
# get object for imputation
knn_obj = KNNImputer(n_neighbors=3)
features_to_use = ['Age','Flight Distance','Departure/Arrival time convenient','Departure Delay in Minutes', 'Arrival Delay in Minutes']
# create a numpy matrix from pandas numeric values to impute
temp = df[features_to_use].to_numpy()
# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
# Deep copy
df_imputed = copy.deepcopy(df) # not just an alias
df_imputed[features_to_use] = temp_imputed
df_imputed.dropna(inplace=True)
There are some things to remove from our dataset because they should have no bearing on class prediction. We're going to leave in general reviews of some flight amenities and experiences because there's a chance these improve the likleihood of a correct prediction (ex: A passenger who gave food service 5 stars is probably more likely to be in business class).
1. overall satisfaction
2. Arrival and departure delay
- This says nothing about which class a person flew in. We will leave in the satisfaction a respondent had with their arrival/delay, as business folk could be more likely to respond a certain way.
3. id
- These numbers aren't especially important, since there are no examples of duplicate IDs and in testing and PCA analysis they seem to interfere with our results.
#del df_imputed['Class']
del df_imputed['Departure Delay in Minutes']
del df_imputed['Arrival Delay in Minutes']
del df_imputed['id']
del df_imputed['satisfaction']
# mapping categorical data to integers in order to run correlation
df_imputed['Customer Type'] = df_imputed['Customer Type'].map({'disloyal Customer': 0, 'Loyal Customer': 1})
df_imputed['Gender'] = df_imputed['Gender'].map({'Male': 0, 'Female': 1})
df_imputed['Type of Travel'] = df_imputed['Type of Travel'].map({'Personal Travel': 0, 'Business travel': 1})
# convert Class to ranked integers for easier analysis
df_imputed['Class'] = df_imputed['Class'].map({'Eco': 1, 'Eco Plus': 2, 'Business': 3})
# normalizing Age and Flight distance using a max absolute value scaling
def MaxAbsScaler(col):
return col / col.abs().max()
# normalization
df_imputed['Age'] = MaxAbsScaler(df_imputed['Age'])
df_imputed['Flight Distance'] = MaxAbsScaler(df_imputed['Flight Distance'])
df_imputed.head()
| Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0.152941 | 0 | 2 | 0.092314 | 3 | 4.0 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 |
| 1 | 0 | 0 | 0.294118 | 1 | 3 | 0.047160 | 3 | 2.0 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 |
| 2 | 1 | 1 | 0.305882 | 1 | 3 | 0.229179 | 2 | 2.0 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 |
| 3 | 1 | 1 | 0.294118 | 1 | 3 | 0.112783 | 2 | 5.0 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 |
| 4 | 0 | 1 | 0.717647 | 1 | 3 | 0.042946 | 3 | 3.0 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 |
Just to visualize and discuss, we're going to print the remaining values and discuss correlations between these fields and our passenger class
# print the correlation of all values and sort them from largest to smallest
print('Correlation to Class by percentage')
print((df_imputed.corr()['Class'].sort_values(ascending=False)))
vars_to_use = [
'Class',
'Departure/Arrival time convenient',
'Ease of Online booking',
'Gate location',
'Food and drink',
'Online boarding',
'Seat comfort',
'Inflight entertainment',
'On-board service',
'Leg room service',
'Baggage handling',
'Checkin service',
'Inflight service',
'Cleanliness',
]
plt.pcolor(df_imputed[vars_to_use].corr())
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use, rotation=90)
plt.colorbar()
plt.show()
Correlation to Class by percentage Class 1.000000 Type of Travel 0.545257 Flight Distance 0.451211 Online boarding 0.322924 Seat comfort 0.227444 On-board service 0.209505 Leg room service 0.204964 Inflight entertainment 0.194366 Baggage handling 0.160460 Inflight service 0.156353 Checkin service 0.151613 Age 0.140565 Cleanliness 0.135818 Ease of Online booking 0.106391 Customer Type 0.105735 Food and drink 0.085908 Inflight wifi service 0.036279 Gate location 0.004150 Gender -0.008253 Departure/Arrival time convenient -0.092788 Name: Class, dtype: float64
Inflight wifi service, Gate location, and Gender all have less than 10% correlation with Class. We're going to remove gate location because we also don't see an intuitive reason to keep it and we're going to remove gender for two reasons:
It's also worth noting again that this classification problem is very difficult; there are fairly low correlations between all attributes and class.
#remove gender and gate location
del df_imputed['Gender']
del df_imputed['Gate location']
#splitting our dataset into X and y
y = df_imputed['Class']
X = df_imputed.drop('Class',axis=1)
df_imputed.groupby('Class')['Age'].count()/len(df_imputed)
Class 1 0.449886 2 0.072124 3 0.477989 Name: Age, dtype: float64
This is good to know for future reference. It's not uncommon in our testing that our accuracy is exactly equal to one of these class percentages. What this means is our training is really bad and the regression learns to just say "1" every time.
len(X.columns)
17
Even after removing lots of fields, we still have a TON of dimensions to our dataset. This is a double edged sword; it can make our result more accurate, but it will slow down the performance of our program. What will we do? Well, we're going to produce and test datasets with reduced dimensions and original dimensions. Then, we'll discuss the pros and cons of both.
But how many dimensions should we use for PCA?
# this is a scree plot
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
pca = PCA(n_components=17)
pca.fit(X)
plot_explained_variance(pca)
We see the "knee" in this dataset around 8 components. Any more and we might as well use the regular dataset! Any less and we worry we'll lose too much accuracy!
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(X) # fit data and then transform it
X_pca = pca.transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
#split the PCA set
X_train_pca, X_test_pca = train_test_split(X_pca, test_size=0.2)
#confirm size of split data
print("Test: ","{:.2%}".format(len(X_test)/len(df_imputed)))
print("Train: ", "{:.2%}".format(len(X_train)/len(df_imputed)),'\n')
print("Test - pca: ","{:.2%}".format(len(X_test_pca)/len(df_imputed)))
print("Train - pca: ", "{:.2%}".format(len(X_train_pca)/len(df_imputed)))
Test: 20.00% Train: 80.00% Test - pca: 20.00% Train - pca: 80.00%
if doing both L1 and L2 regularization (regularization=3), C needs to be a tuple (C1, C2). Otherwise C is a single value
choices for solver: SteepestAscent, StochasticGradientAscent, BFGS
class SteepestAscent:
def __init__(self, eta, iterations=20, C=0.001, regularization=2):
self.eta = eta
self.iters = iterations
self.reg = regularization
self.C1 = self.C2 = 0
if regularization==1:
self.C1 = C
elif regularization==2:
self.C2 = C
elif regularization==3:
self.C1, self.C2 = C
# store weights internally as self.w_
def __str__(self):
if(hasattr(self,'w_')):
return'Binary Logistic Regression Object with coefficients:\n' + str(self.w_) # if we trained object
else:
return 'Untrained Binary Logistic Regression Object'
# helper, private
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)), X)) # adds bias term [0, X]
@staticmethod
def _sigmoid(theta):
# scipy element wise sigmoid function
return expit(theta) # 1/(1+np.exp(-theta))
# vectorized gradient w L2 regularization
def _get_gradient(self, X, y):
ydiff = y - self.predict_proba(X, add_bias=False).ravel() # y - g(w^T*X)
gradient = np.mean(X * ydiff[:,np.newaxis], axis = 0) # make ydiff column vector and mult through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += (np.sign(self.w_[1:]) * self.C1) + (-2 * self.w_[1:] * self.C2)
return gradient
# public
def predict_proba(self, X, add_bias=True):
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_) # return probability of y=1
def predict(self, X):
return (self.predict_proba(X) > 0.5) # return actual prediciton
def fit(self, X, y):
Xb = self._add_bias(X)
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to 0s
# max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # w <- w + eta*gradient
# add bc maximizing
class StocasticGradientAscent(SteepestAscent):
def _get_gradient(self, X, y):
idx = int(np.random.rand()*len(y)) # grab random instance
ydiff = y.iloc[idx] - self.predict_proba(X[idx], add_bias = False) # ydiff now scalar
gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and mult through
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += (np.sign(self.w_[1:]) * self.C1) + (-2 * self.w_[1:] * self.C2)
return gradient
from scipy.optimize import fmin_bfgs
from numpy import ma
class BFGS(SteepestAscent):
@staticmethod
def objective_function(w, X, y, C):
g = expit(X @ w)
C1, C2 = C
return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C1*sum(np.abs(w)) + C2*sum(w**2)
# -np.sum(y*np.log(g) + (1-y) * np.log(1-g))
@staticmethod
def objective_gradient(w, X, y, C):
g = expit(X @ w)
ydiff = y-g
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(w.shape)
C1, C2 = C
gradient[1:] += (np.sign(w[1:]) * C1) + (-2 * w[1:] * C2)
return -gradient
def fit(self, X, y):
Xb = self._add_bias(X)
num_samples, num_features = Xb.shape
self.w_ = fmin_bfgs(self.objective_function, # what to optimize
np.zeros((num_features, 1)), # starting point
fprime=self.objective_gradient, # gradient function
args=(Xb, y, (self.C1,self.C2)), # extra args for gradient and objective function
gtol=1e-03, # stopping criteria for gradient, |v_k|
maxiter=self.iters, # stopping criteria iterations
disp=False)
self.w_ = self.w_.reshape((num_features, 1))
# solver: SteepestAscent, StochasticGradientAscent, BFGS
# regularization parameter: 0-none, 1-L1, 2-L2, 3-L1&L2(here C needs to be a tuple as (C1, C2))
class LogisticRegression:
def __init__(self, eta, iterations=20, regularization=2, C=0.001, solver=SteepestAscent):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.reg = regularization
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = self.solver(eta=self.eta, iterations=self.iters, C=self.C, regularization=self.reg)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def fit_OvO(self,X,y): # my failed attempt at a one-vs-one classifier
num_samples, num_features = X.shape
combos = pd.Series(list(it.combinations(np.unique(y), 2)))
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(combos): # for each unique value
y_binary = [c for c in y if c==y_val[0] or c==y_val[1]]
y_binary = (y_binary == y_val[0])
X_binary = [c for c in y if c==y_val[0] or c==y_val[1]]
blr = self.solver(eta=self.eta, iterations=self.iters, C=self.C, regularization=self.reg)
blr.fit(X_binary,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
Let's start with SKLearn to get an understanding of where the benchmark is. Let's train and test on a standard SKL liblinear solver
%%time
from sklearn.linear_model import LogisticRegression as SKLogisticRegression
lr_sk = SKLogisticRegression(solver='liblinear') # all params default
lr_sk.fit(X_train,y_train)
yhat = lr_sk.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr_sk.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.7799766610925977 Testing accuracy of: 0.7787401953707713 CPU times: user 1.77 s, sys: 95.8 ms, total: 1.87 s Wall time: 1.85 s
~78% accuracy in under 2 seconds! Not bad... Can we do better?
%%time
lr = LogisticRegression(eta=1,
iterations=250,
regularization=0,
C=0.01,
solver=StocasticGradientAscent
)
lr.fit(X_train,y_train)
yhat = lr.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.4533642914716745 Testing accuracy of: 0.4552235214859728 CPU times: user 142 ms, sys: 184 ms, total: 325 ms Wall time: 260 ms
Not a great accuracy... Although this is a pretty rudimentary run, and it was fast! Let's change everything and see what our optimal run is!
Running this chunk will take a LONG time. As soon as it stops running, I'll let you know!
import time
import warnings
warnings.filterwarnings("ignore")
def robust_test(X_train,y_train,X_test,y_test):
data = []
for regularization in [0,1,2]:
for eta in range(5):
eta = 1 / 10**eta
for iterations in [10,20,50]:
for C in range(5):
C = 1 / 10**C
for solver in [StocasticGradientAscent]:
startTime = time.time()
lr = LogisticRegression(eta=eta,iterations=iterations,regularization=regularization,
C=C,solver=solver)
lr.fit(X_train,y_train)
yhat = lr.predict(X_train)
y_hat_test = lr.predict(X_test)
run_time = (time.time() - startTime)
data.append({'reg' : regularization, 'eta' : eta, 'iterations' : iterations, 'C' : C, 'solver' :
solver, 'train_accuracy' : accuracy_score(y_train,yhat), 'test_accuracy' : accuracy_score(y_test,y_hat_test), 'run_time' : run_time})
return data
data = robust_test(X_train,y_train,X_test,y_test)
df_data = pd.DataFrame(data)
df_data.to_csv("run_out2.csv")
data_pca = robust_test(X_train_pca,y_train,X_test_pca,y_test)
df_data_pca = pd.DataFrame(data_pca)
df_data_pca.to_csv("run_out_pca2.csv")
df_tests.sort_values(by='test_accuracy',ascending=False)[:1]
| Unnamed: 0 | reg | eta | iterations | C | solver | train_accuracy | test_accuracy | run_time | |
|---|---|---|---|---|---|---|---|---|---|
| 268 | 268 | 1 | 0.001 | 50 | 0.0001 | BFGS | 0.779808 | 0.785092 | 20.802865 |
Here is our best performance logistic regression optimization procedure. Since we plan to run this offline, we're going to turn up iterations to a higher number. This will kill our time but should help accuracy (we just want to beat SKLearn at something!)
%%time
warnings.filterwarnings("ignore")
lr = LogisticRegression(eta=0.001,
iterations=250,
regularization=1,
C=0.0001,
solver=BFGS
)
lr.fit(X_train,y_train)
yhat = lr.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.7800488432804399 Testing accuracy of: 0.7790770415283191 CPU times: user 9.28 s, sys: 11.2 s, total: 20.5 s Wall time: 20.5 s
%%time
lr_sk = SKLogisticRegression(solver='liblinear') # all params default
lr_sk.fit(X_train,y_train)
yhat = lr_sk.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr_sk.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.7799766610925977 Testing accuracy of: 0.7787401953707713 CPU times: user 1.74 s, sys: 104 ms, total: 1.85 s Wall time: 1.84 s
And testing the same method on PCA
%%time
lr.fit(X_train_pca,y_train)
yhat = lr.predict(X_train_pca)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr.predict(X_test_pca)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.47851978393465106 Testing accuracy of: 0.4752899282998893 CPU times: user 978 ms, sys: 1.25 s, total: 2.23 s Wall time: 2.17 s
%%time
lr_sk.fit(X_train_pca,y_train)
yhat = lr_sk.predict(X_train_pca)
print('Training accuracy of: ',accuracy_score(y_train,yhat))
y_hat_test = lr_sk.predict(X_test_pca)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of: 0.47851978393465106 Testing accuracy of: 0.47562677445743706 CPU times: user 313 ms, sys: 55.9 ms, total: 369 ms Wall time: 300 ms
Ok. Let us discuss these results. We will have in-depth visualizations of the various inputs of our model and their performance in the exceptional work section (we're splitting it up!)
First, we're quite pumped that we beat SKLearn in accuracy by the tiniest margin! However, the time it took to run is unfortunately very high. We will discuss this in greater depth in deployment, but the marginal benefit in accuracy of running hundreds of iterations and using BFGS (which is very time consuming), may be outweighed by the substantial time it took to run. This all depends on the use case.
This also being said, we trained and tested over 100k instances in less than half a minute. While this may be a longer run time than SKLearn, it is a relatively fast run time given the datasets probably won't reach beyond a million at the most.
Shockingly, we believe that the DREAM TEAM's implementation, using the optimal values we derived above, would be better than SKLearn if it were deployed as a machine learning model in this specific business use case.
Reasoning: You may be saying, "ARE YOU INSANE?" Yes, but we actually have a simple rationale for our reccomendation in this use case. In our classification problem, accuracy is the most important factor. If we are able to correctly assess which passenger flies in which class, we can spend a difference of tens to hundreds of dollars and earn much more from their business (See LTV/CAC discussion above).
Because our model is being trained offline, and it can be trained on 100k+ instances in the amount of time it would take to write a short business memo, we believe training time is negligable and can be ignored for this deployment. Therefore, we advise this is deployed over Scikit-learn.
We believe there is some balance between speed and accuracy that our implementation is not finding. While we are sad to say so, a general SKLearn implemetaion is probably a better balance for most problems and it is only nominally inferior from an accuracy standpoint to our implementation.
When discussing whether or not we should use PCA, we must consider the tradeoffs between speed and accuracy. Unfortunately, while we thought we had found the elusive "knee" in our PCA analysis, our accuracy was extremely subpar. It is possible we found the knee, but a greater number of dimensions is Unfortunately required to run linear regression to a satisfactory level on this dataset.
THIS BEING SAID... We were very surprised by the impressive difference in time between training and testing PCA and the full dataset (which we will do in exceptional work). If the client believes speed is an important consideration, then a PCA implementation with more dimensions could be a perfect balance and still superior to SKL.
For the exceptional work we attempted to do a one-vs-one extension, and the code can be seen in the logistic regression methods, however, we could not finalize it in order for it to work correctly in time. Instead, we hope our extensive analysis, discussion of PCA implementations and the robust testing with lots of graphs (below) comprises exceptional work.
You can see this implementation in: fit_OvO
#import robust runs
df_tests = pd.read_csv("run_out_robust.csv")
df_tests_pca = pd.read_csv("run_out_robust_pca.csv")
width = 0.3
names = ['BFGS','SA','SGA']
axis = np.arange(3)
plt.bar(axis-0.15, df_tests.groupby('solver').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('solver').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('method')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('solver').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('solver').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('method')
plt.ylabel('Average Time in Seconds')
plt.legend()
<matplotlib.legend.Legend at 0x7f266062e130>
The above charts show a couple crucial things about our implementation. First, BFGS is vastly superior in accuracy to the other two methods. In fact, SA and SGA are both below 50% accuracy on average. This being said, the mean may not be the best value to look at when considering we want the absolute best considering the parameters we choose. Let's examine a boxplot of the means and the values for these with maxes
fig, ax = plt.subplots(figsize=(10,8))
df_tests.groupby('solver').boxplot(column=['train_accuracy','test_accuracy'],ax=ax)
BFGS AxesSubplot(0.1,0.559091;0.363636x0.340909) SA AxesSubplot(0.536364,0.559091;0.363636x0.340909) SGA AxesSubplot(0.1,0.15;0.363636x0.340909) dtype: object
fig, ax = plt.subplots(figsize=(10,8))
df_tests.groupby('solver').boxplot(column=['run_time'],ax=ax)
BFGS AxesSubplot(0.1,0.559091;0.363636x0.340909) SA AxesSubplot(0.536364,0.559091;0.363636x0.340909) SGA AxesSubplot(0.1,0.15;0.363636x0.340909) dtype: object
BFGS has very few outliers and relatively stable performance in terms of accuracy compared to the other two. At the same time, the run time distribution is both much higher than for the other two and quite a wide range. It does appear to us that Steepest Ascent could be used if given the right parameters, if run time is a larger factor than accuracy.
width = 0.3
names = [10,20,50]
axis = np.arange(3)
plt.bar(axis-0.15, df_tests.groupby('iterations').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('iterations').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('iterations')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('iterations').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('iterations').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('iterations')
plt.ylabel('Average Time in Seconds')
plt.legend()
<matplotlib.legend.Legend at 0x2259d854e80>
It comes as minimal surprise that the more iterations that are run, the better the performance and the longer it takes to run. This seems to be true asympotically in terms of accuracy and linearly with run time.
width = 0.3
names = ['0.0001','0.001','0.01','0.1','1']
axis = np.arange(5)
plt.bar(axis-0.15, df_tests.groupby('C').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('C').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('C value')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('C').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('C').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('C value')
plt.ylabel('Average Time in Seconds')
plt.legend()
<matplotlib.legend.Legend at 0x7f26605103a0>
C value has minimal bearing on accuracy except for C = 0.1, which is slower AND less accurate than its peers. It does seem that the smallest C value has high average accuracy and minimizes run time.